Central Region
SimpleQA Verified: A Reliable Factuality Benchmark to Measure Parametric Knowledge
Haas, Lukas, Yona, Gal, D'Antonio, Giovanni, Goldshtein, Sasha, Das, Dipanjan
We introduce SimpleQA Verified, a 1,000-prompt benchmark for evaluating Large Language Model (LLM) short-form factuality based on OpenAI's SimpleQA. It addresses critical limitations in OpenAI's benchmark, including noisy and incorrect labels, topical biases, and question redundancy. SimpleQA Verified was created through a rigorous multi-stage filtering process involving de-duplication, topic balancing, and source reconciliation to produce a more reliable and challenging evaluation set, alongside improvements in the autorater prompt. On this new benchmark, Gemini 2.5 Pro achieves a state-of-the-art F1-score of 55.6, outperforming other frontier models, including GPT-5. This work provides the research community with a higher-fidelity tool to track genuine progress in parametric model factuality and to mitigate hallucinations. The benchmark dataset, evaluation code, and leaderboard are available at: https://www.kaggle.com/benchmarks/deepmind/simpleqa-verified.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- South America > Colombia (0.04)
- North America > United States > Massachusetts > Hampshire County > Amherst (0.04)
- (7 more...)
- Leisure & Entertainment (1.00)
- Government (0.69)
- Media > Television (0.46)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.46)
mwBTFreddy: A Dataset for Flash Flood Damage Assessment in Urban Malawi
Chapuma, Evelyn, Mengezi, Grey, Msasa, Lewis, Taylor, Amelia
This paper describes the mwBTFreddy dataset, a resource developed to support flash flood damage assessment in urban Malawi, specifically focusing on the impacts of Cyclone Freddy in 2023. The dataset comprises paired pre- and post-disaster satellite images sourced from Google Earth Pro, accompanied by JSON files containing labelled building annotations with geographic coordinates and damage levels (no damage, minor, major, or destroyed). Developed by the Kuyesera AI Lab at the Malawi University of Business and Applied Sciences, this dataset is intended to facilitate the development of machine learning models tailored to building detection and damage classification in African urban contexts. It also supports flood damage visualisation and spatial analysis to inform decisions on relocation, infrastructure planning, and emergency response in climate-vulnerable regions.
- Africa > Malawi > Southern Region > Blantyre District > Blantyre (0.05)
- North America > United States > California > Alameda County > Berkeley (0.04)
- Africa > Southern Africa (0.04)
- (4 more...)
Using Machine Learning to Detect Fraudulent SMSs in Chichewa
SMS enabled fraud is of great concern globally. Building classifiers based on machine learning for SMS fraud requires the use of suitable datasets for model training and validation. Most research has centred on the use of datasets of SMSs in English. This paper introduces a first dataset for SMS fraud detection in Chichewa, a major language in Africa, and reports on experiments with machine learning algorithms for classifying SMSs in Chichewa as fraud or non-fraud. We answer the broader research question of how feasible it is to develop machine learning classification models for Chichewa SMSs. To do that, we created three datasets. A small dataset of SMS in Chichewa was collected through primary research from a segment of the young population. We applied a label-preserving text transformations to increase its size. The enlarged dataset was translated into English using two approaches: human translation and machine translation. The Chichewa and the translated datasets were subjected to machine classification using random forest and logistic regression. Our findings indicate that both models achieved a promising accuracy of over 96% on the Chichewa dataset. There was a drop in performance when moving from the Chichewa to the translated dataset. This highlights the importance of data preprocessing, especially in multilingual or cross-lingual NLP tasks, and shows the challenges of relying on machine-translated text for training machine learning models. Our results underscore the importance of developing language specific models for SMS fraud detection to optimise accuracy and performance. Since most machine learning models require data preprocessing, it is essential to investigate the impact of the reliance on English-specific tools for data preprocessing.
- Africa > South Africa > Gauteng > Johannesburg (0.04)
- Africa > Kenya (0.04)
- Asia > Pakistan (0.04)
- (12 more...)
- Telecommunications (1.00)
- Information Technology > Security & Privacy (1.00)
- Government (1.00)
- Law Enforcement & Public Safety > Fraud (0.86)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.68)
HumT DumT: Measuring and controlling human-like language in LLMs
Cheng, Myra, Yu, Sunny, Jurafsky, Dan
Should LLMs generate language that makes them seem human? Human-like language might improve user experience, but might also lead to overreliance and stereotyping. Assessing these potential impacts requires a systematic way to measure human-like tone in LLM outputs. We introduce HumT and SocioT, metrics for human-like tone and other dimensions of social perceptions in text data based on relative probabilities from an LLM. By measuring HumT across preference and usage datasets, we find that users prefer less human-like outputs from LLMs. HumT also offers insights into the impacts of anthropomorphism: human-like LLM outputs are highly correlated with warmth, social closeness, femininity, and low status, which are closely linked to the aforementioned harms. We introduce DumT, a method using HumT to systematically control and reduce the degree of human-like tone while preserving model performance. DumT offers a practical approach for mitigating risks associated with anthropomorphic language generation.
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- Europe > Austria > Vienna (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- (14 more...)
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.69)
- Health & Medicine > Therapeutic Area (0.68)
- Government (0.68)
How AI monitoring is cutting stillbirths and neonatal deaths in a clinic in Malawi
When Ellen Kaphamtengo felt a sharp pain in her lower abdomen, she thought she might be in labour. It was the ninth month of her first pregnancy and she wasn't taking any chances. With the help of her mother, the 18-year-old climbed on to a motorcycle taxi and rushed to a hospital in Malawi's capital, Lilongwe, a 20-minute ride away. At the Area 25 health centre, they told her it was a false alarm and took her to the maternity ward. But things escalated quickly when a routine ultrasound revealed that her baby was much smaller than expected for her pregnancy stage, which can cause asphyxia – a condition that limits blood flow and oxygen to the baby.
- Africa > Malawi > Central Region > Lilongwe District > Lilongwe (0.25)
- North America > United States > Texas (0.07)
SeaKR: Self-aware Knowledge Retrieval for Adaptive Retrieval Augmented Generation
Yao, Zijun, Qi, Weijian, Pan, Liangming, Cao, Shulin, Hu, Linmei, Liu, Weichuan, Hou, Lei, Li, Juanzi
This paper introduces Self-aware Knowledge Retrieval (SeaKR), a novel adaptive RAG model that extracts self-aware uncertainty of LLMs from their internal states. SeaKR activates retrieval when the LLMs present high self-aware uncertainty for generation. To effectively integrate retrieved knowledge snippets, SeaKR re-ranks them based on LLM's self-aware uncertainty to preserve the snippet that reduces their uncertainty to the utmost. To facilitate solving complex tasks that require multiple retrievals, SeaKR utilizes their self-aware uncertainty to choose among different reasoning strategies. Our experiments on both complex and simple Question Answering datasets show that SeaKR outperforms existing adaptive RAG methods. We release our code at https://github.com/THU-KEG/SeaKR.
- Africa > Tanzania > Dar es Salaam Region > Dar es Salaam (0.05)
- Africa > Kenya > Nairobi Province (0.04)
- Africa > Kenya > Nairobi City County > Nairobi (0.04)
- (37 more...)
- Media > Film (1.00)
- Media > Television (0.68)
- Media > Music (0.68)
- (4 more...)
The AI revolution comes for farmers growing a third of our food
In the village of Ndodo, 40 kilometers south of the Malawian capital Lilongwe, farmers gather in the shade of an acacia tree as a voice over a smartphone tells them how to get rid of a weevil that's destroying their sweet potato crops. The tips offered by the app in the local language Chichewa is one of the first examples of how artificial intelligence is being used to aid subsistence farmers in some of the poorest parts of the world. Piloted by a Chicago-based nonprofit organization Opportunity International, the app called Ulangizi -- which translates as "Advice" -- works on WhatsApp and uses data from ChatGPT and the Malawian government's English-language agricultural manual to answer questions or diagnose crop and farm animal diseases. "The majority of our people do not know how to read or write," said Anna Chimalizeni, a 36-year-old mother of three, who as a government farmer-support agent demonstrates the app to farmers. "I am there to help them write issues they have at their farms and read the response on their behalf. They also have a chance to listen to the response through voice notes which come in our own local language."
- North America > United States > Illinois > Cook County > Chicago (0.29)
- Africa > Malawi > Central Region > Lilongwe District > Lilongwe (0.29)
Development of Semantics-Based Distributed Middleware for Heterogeneous Data Integration and its Application for Drought
Drought is a complex environmental phenomenon that affects millions of people and communities all over the globe and is too elusive to be accurately predicted. This is mostly due to the scalability and variability of the web of environmental parameters that directly/indirectly causes the onset of different categories of drought. Since the dawn of man, efforts have been made to uniquely understand the natural indicators that provide signs of likely environmental events. These indicators/signs in the form of indigenous knowledge system have been used for generations. The intricate complexity of drought has, however, always been a major stumbling block for accurate drought prediction and forecasting systems. Recently, scientists in the field of agriculture and environmental monitoring have been discussing the integration of indigenous knowledge and scientific knowledge for a more accurate environmental forecasting system in order to incorporate diverse environmental information for a reliable drought forecast. Hence, in this research, the core objective is the development of a semantics-based data integration middleware that encompasses and integrates heterogeneous data models of local indigenous knowledge and sensor data towards an accurate drought forecasting system for the study areas. The local indigenous knowledge on drought gathered from the domain experts is transformed into rules to be used for performing deductive inference in conjunction with sensors data for determining the onset of drought through an automated inference generation module of the middleware. The semantic middleware incorporates, inter alia, a distributed architecture that consists of a streaming data processing engine based on Apache Kafka for real-time stream processing; a rule-based reasoning module; an ontology module for semantic representation of the knowledge bases.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- North America > United States > New York > New York County > New York City (0.13)
- Africa > Sub-Saharan Africa (0.04)
- (50 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Questionnaire & Opinion Survey (1.00)
- Personal (0.92)
- Health & Medicine (1.00)
- Government (1.00)
- Food & Agriculture > Agriculture (1.00)
- (3 more...)
Flickr Africa: Examining Geo-Diversity in Large-Scale, Human-Centric Visual Data
Naggita, Keziah, LaChance, Julienne, Xiang, Alice
Biases in large-scale image datasets are known to influence the performance of computer vision models as a function of geographic context. To investigate the limitations of standard Internet data collection methods in low- and middle-income countries, we analyze human-centric image geo-diversity on a massive scale using geotagged Flickr images associated with each nation in Africa. We report the quantity and content of available data with comparisons to population-matched nations in Europe as well as the distribution of data according to fine-grained intra-national wealth estimates. Temporal analyses are performed at two-year intervals to expose emerging data trends. Furthermore, we present findings for an ``othering'' phenomenon as evidenced by a substantial number of images from Africa being taken by non-local photographers. The results of our study suggest that further work is required to capture image data representative of African people and their environments and, ultimately, to improve the applicability of computer vision models in a global context.
- Asia > Brunei (0.14)
- North America > Canada > Quebec > Montreal (0.06)
- Africa > Sierra Leone (0.06)
- (142 more...)
- Health & Medicine (0.92)
- Information Technology > Services (0.75)
- Government > Regional Government (0.46)
- Information Technology > Communications > Social Media (1.00)
- Information Technology > Artificial Intelligence > Vision (1.00)
Exploring the Benefits of Training Expert Language Models over Instruction Tuning
Jang, Joel, Kim, Seungone, Ye, Seonghyeon, Kim, Doyoung, Logeswaran, Lajanugen, Lee, Moontae, Lee, Kyungjae, Seo, Minjoon
Recently, Language Models (LMs) instruction-tuned on multiple tasks, also known as multitask-prompted fine-tuning (MT), have shown the capability to generalize to unseen tasks. Previous work has shown that scaling the number of training tasks is the key component in making stronger MT LMs. In this work, we report an unexpected finding that an expert LM fine-tuned on just a single task can outperform an MT LM trained with 300+ different tasks on 11 different unseen datasets and on 13 datasets of the BIG-bench benchmark by a mean accuracy of 3.20% and 1.29%, respectively. This finding casts doubt on the previously held belief that simply scaling the number of tasks makes stronger MT LMs. Leveraging this finding, we further show that this distributed approach of training a separate expert LM per training task instead of a single MT LM for zero-shot inference possesses many benefits including (1) avoiding negative task transfer that often occurs during instruction tuning, (2) being able to continually learn new tasks without having to re-train on previous tasks to avoid catastrophic forgetting, and (3) showing compositional capabilities when merging individual experts together. The code is available at https://github.com/joeljang/ELM.
- Europe > France (0.28)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Europe > Italy > Tuscany > Florence (0.04)
- (24 more...)
- Research Report (1.00)
- Overview (0.93)